New Splitting Criteria for Decision Trees in Stationary Data Streams.

نویسندگان

  • Maciej Jaworski
  • Piotr Duda
  • Leszek Rutkowski
چکیده

The most popular tools for stream data mining are based on decision trees. In previous 15 years, all designed methods, headed by the very fast decision tree algorithm, relayed on Hoeffding's inequality and hundreds of researchers followed this scheme. Recently, we have demonstrated that although the Hoeffding decision trees are an effective tool for dealing with stream data, they are a purely heuristic procedure; for example, classical decision trees such as ID3 or CART cannot be adopted to data stream mining using Hoeffding's inequality. Therefore, there is an urgent need to develop new algorithms, which are both mathematically justified and characterized by good performance. In this paper, we address this problem by developing a family of new splitting criteria for classification in stationary data streams and investigating their probabilistic properties. The new criteria, derived using appropriate statistical tools, are based on the misclassification error and the Gini index impurity measures. The general division of splitting criteria into two types is proposed. Attributes chosen based on type-$I$ splitting criteria guarantee, with high probability, the highest expected value of split measure. Type-$II$ criteria ensure that the chosen attribute is the same, with high probability, as it would be chosen based on the whole infinite data stream. Moreover, in this paper, two hybrid splitting criteria are proposed, which are the combinations of single criteria based on the misclassification error and Gini index.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Concept Drift in Decision Trees Learning from Data Streams

This paper presents the Ultra Fast Forest of Trees (UFFT) system. It is an incremental algorithm that works online, processing each example in constant time, and performing a single scan over the training examples. The system has been designed for numerical data. It uses analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible split...

متن کامل

Learning in Dynamic Environments: Decision Trees for Data Streams

This paper presents an adaptive learning system for induction of forest of trees from data streams able to detect Concept Drift. We have extended our previous work on Ultra Fast Forest Trees (UFFT) with the ability to detect concept drift in the distribution of the examples. The Ultra Fast Forest of Trees is an incremental algorithm, that works online, processing each example in constant time, ...

متن کامل

Incremental Learning Algorithm for Dynamic Data Streams

The recent advances in hardware and software have enabled the capture of different measurements of data in a wide range of fields. These measurements are generated continuously and in a very high fluctuating data rates. Examples include sensor networks, web logs, and computer network traffic. The storage, querying and mining of such data sets are highly computationally challenging tasks. Mining...

متن کامل

A New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining

Decision-tree algorithms provide one of the most popular methodologies for symbolic knowledge acquisition. The resulting knowledge, a symbolic decision tree along with a simple inference mechanism, has been praised for comprehensibility. The most comprehensible decision trees have been designed for perfect symbolic data. Classical crisp decision trees (DT) are widely applied to classification t...

متن کامل

Udc 519.95 Binary Decision Tree Synthesis: Splitting Criteria and the Algorithm Listbb

In our days, interest to the class of inductors on the basis of decision trees does not weaken, especially in the context of Data Mining paradigm . At the same time most widespread Quinlan algorithms ID3 and C4.5, as we show in the paper, are not the best. It is therefore possible to see the successful attempts of creation another heuristic splitting criteria for the algorithms of synthesis of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE transactions on neural networks and learning systems

دوره   شماره 

صفحات  -

تاریخ انتشار 2017